Extracting Aggregate Answer Statistics for Integration

نویسندگان

  • Zainab Zolaktaf
  • Jian Xu
  • Rachel Pottinger
چکیده

Aggregate queries in integration contexts often do not have one “true” answer; there can be multiple correct answers for the same aggregate query. This is due to the existence of duplicate or overlapping data points, possibly with di↵erent values, across the data sources. Depending on the choice of data source combinations that are used to answer the query, di↵erent answers can be generated. Thus, representing the answer to the aggregate query as an answer distribution instead of a single scalar value, will allow the users to better understand the range of possible answers. This work provides a suite of methods for extracting statistics that convey meaningful information about aggregate query answers in heterogeneous integration settings. We focus on the following challenges: 1. determining which statistics best represent an answer’s distribution; and 2. e ciently computing the desired statistics. Our solution includes the following answer statistics 1. a set of point estimates with confidence intervals; 2. a high coverage interval that unveils “hot areas” in a distribution; and 3. a stability score that measures the impact of source dynamics. We optimize the extraction of the above statistical information by minimizing the sampling load and applying fast approximate algorithms. We verify the e↵ectiveness and e ciency of our methods with empirical studies using real-life and synthetic, scaled data sets.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

WORK IN PROGRESS: Data Explorer – Assessment Data Integration, An- alytics, and Visualization for STEM Education Research

We describe a comprehensive system for comparative evaluation of uploaded and preprocessed data in physics education research with applicability to standardized assessments for disciplinebased education research, especially in science, technology, mathematics, and engineering. Views are provided for inspection of aggregate statistics about student scores, comparison over time within one course,...

متن کامل

ارائه روشی پویا جهت پاسخ به پرس‌وجوهای پیوسته تجمّعی اقتضایی

Data Streams are infinite, fast, time-stamp data elements which are received explosively. Generally, these elements need to be processed in an online, real-time way. So, algorithms to process data streams and answer queries on these streams are mostly one-pass. The execution of such algorithms has some challenges such as memory limitation, scheduling, and accuracy of answers. They will be more ...

متن کامل

Scaling the walls of discovery: using semantic metadata for integrative problem solving

Current data integration approaches by bioinformaticians frequently involve extracting data from a wide variety of public and private data repositories, each with a unique vocabulary and schema, via scripts. These separate data sets must then be normalized through the tedious and lengthy process of resolving naming differences and collecting information into a single view. Attempts to consolida...

متن کامل

Strategic Human Resource Development Model Designing in National Iranian Oil Company

Today human resource Strategies are very important for human resource systems. While talking about strategies, integration and coordination is something is much more importance than strategies formulation and implementations. This study presents a model for the strategic development of human resource based on competencies model. In fact, the main question of this research is: What are the affec...

متن کامل

Extended aggregations for databases with referential integrity issues

Querying databases with incomplete or inconsistent content remains a broad and difficult problem. In this work, we study how to improve aggregations computed on databases with referential errors in the context of database integration, where each source database has different tables, columns with similar content across multiple databases, but different referential integrity constraints. Thus, a ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015